36 research outputs found
A Collaborative Approach to Computational Reproducibility
Although a standard in natural science, reproducibility has been only
episodically applied in experimental computer science. Scientific papers often
present a large number of tables, plots and pictures that summarize the
obtained results, but then loosely describe the steps taken to derive them. Not
only can the methods and the implementation be complex, but also their
configuration may require setting many parameters and/or depend on particular
system configurations. While many researchers recognize the importance of
reproducibility, the challenge of making it happen often outweigh the benefits.
Fortunately, a plethora of reproducibility solutions have been recently
designed and implemented by the community. In particular, packaging tools
(e.g., ReproZip) and virtualization tools (e.g., Docker) are promising
solutions towards facilitating reproducibility for both authors and reviewers.
To address the incentive problem, we have implemented a new publication model
for the Reproducibility Section of Information Systems Journal. In this
section, authors submit a reproducibility paper that explains in detail the
computational assets from a previous published manuscript in Information
Systems
The PBase Scientific Workflow Provenance Repository
Scientific workflows and their supporting systems are becoming increasingly popular for compute-intensive and data-intensive scientific experiments. The advantages scientific workflows offer include rapid and easy workflow design, software and data reuse, scalable execution, sharing and collaboration, and other advantages that altogether facilitate âreproducible scienceâ. In this context, provenance â information about the origin, context, derivation, ownership, or history of some artifact â plays a key role, since scientists are interested in examining and auditing the results of scientific experiments. However, in order to perform such analyses on scientific results as part of extended research collaborations, an adequate environment and tools are required. Concretely, the need arises for a repository that will facilitate the sharing of scientific workflows and their associated execution traces in an interoperable manner, also enabling querying and visualization. Furthermore, such functionality should be supported while taking performance and scalability into account. With this purpose in mind, we introduce PBase: a scientific workflow provenance repository implementing the ProvONE proposed standard, which extends the emerging W3C PROV standard for provenance data with workflow specific concepts. PBase is built on the Neo4j graph database, thus offering capabilities such as declarative and efficient querying. Our experiences demonstrate the power gained by supporting various types of queries for provenance data. In addition, PBase is equipped with a user friendly interface tailored for the visualization of scientific workflow provenance data, making the specification of queries and the interpretation of their results easier and more effective
HESML: A scalable ontology-based semantic similarity measures library with a set of reproducible experiments and a replication dataset
This work is a detailed companion reproducibility paper of the methods and experiments proposed by Lastra-DĂaz and GarcĂa-Serrano in (2015, 2016) [56â58], which introduces the following contributions: (1) a new and efficient representation model for taxonomies, called PosetHERep, which is an adaptation of the half-edge data structure commonly used to represent discrete manifolds and planar graphs; (2) a new Java software library called the Half-Edge Semantic Measures Library (HESML) based on PosetHERep, which implements most ontology-based semantic similarity measures and Information Content (IC) models reported in the literature; (3) a set of reproducible experiments on word similarity based on HESML and ReproZip with the aim of exactly reproducing the experimental surveys in the three aforementioned works; (4) a replication framework and dataset, called WNSimRep v1, whose aim is to assist the exact replication of most methods reported in the literature; and finally, (5) a set of scalability and performance benchmarks for semantic measures libraries. PosetHERep and HESML are motivated by several drawbacks in the current semantic measures libraries, especially the performance and scalability, as well as the evaluation of new methods and the replication of most previous methods. The reproducible experiments introduced herein are encouraged by the lack of a set of large, self-contained and easily reproducible experiments with the aim of replicating and confirming previously reported results. Likewise, the WNSimRep v1 dataset is motivated by the discovery of several contradictory results and difficulties in reproducing previously reported methods and experiments. PosetHERep proposes a memory-efficient representation for taxonomies which linearly scales with the size of the taxonomy and provides an efficient implementation of most taxonomy-based algorithms used by the semantic measures and IC models, whilst HESML provides an open framework to aid research into the area by providing a simpler and more efficient software architecture than the current software libraries. Finally, we prove the outperformance of HESML on the state-of-the-art libraries, as well as the possibility of significantly improving their performance and scalability without caching using PosetHERep
YesWorkflow:A User-Oriented, Language-Independent Tool for Recovering Workflow Information from Scripts
Scientific workflow management systems offer features for composing complex
computational pipelines from modular building blocks, for executing the
resulting automated workflows, and for recording the provenance of data
products resulting from workflow runs. Despite the advantages such features
provide, many automated workflows continue to be implemented and executed
outside of scientific workflow systems due to the convenience and familiarity
of scripting languages (such as Perl, Python, R, and MATLAB), and to the high
productivity many scientists experience when using these languages. YesWorkflow
is a set of software tools that aim to provide such users of scripting
languages with many of the benefits of scientific workflow systems. YesWorkflow
requires neither the use of a workflow engine nor the overhead of adapting code
to run effectively in such a system. Instead, YesWorkflow enables scientists to
annotate existing scripts with special comments that reveal the computational
modules and dataflows otherwise implicit in these scripts. YesWorkflow tools
extract and analyze these comments, represent the scripts in terms of entities
based on the typical scientific workflow model, and provide graphical
renderings of this workflow-like view of the scripts. Future versions of
YesWorkflow also will allow the prospective provenance of the data products of
these scripts to be queried in ways similar to those available to users of
scientific workflow systems
311 Dataset
This is a version of the 311 dataset used in the following paper:
Data Polygamy: The Many-Many Relationships among Urban Spatio-Temporal Data Sets, F. Chirigati, H. Doraiswamy, T. Damoulas, and J. Freire. In Proceedings of the 2016 ACM SIGMOD International Conference on Management of Data (SIGMOD), 2016
The dataset includes records from 311, a telephone number that provides non-emergency services to New York City, from 2003 to 2014.
The original data is available at the NYC Open Data portal
Weather Dataset
This is a version of the weather dataset used in the following paper:<div><br></div><div><i>Data Polygamy: The Many-Many Relationships among Urban Spatio-Temporal Data Sets, F. Chirigati, H. Doraiswamy, T. Damoulas, and J. Freire. In Proceedings of the 2016 ACM SIGMOD International Conference on Management of Data (SIGMOD), 2016</i></div><div><br></div><div>The dataset includes records of weather data for New York City, from 2010 to 2014.</div><div><br></div><div>The original data is available at the National Climatic Data Center website:Â http://www7.ncdc.noaa.gov/CDO/dataproduct</div
Preserving and Reproducing Research with ReproZip
Introduction to using ReproZip to aid in reproducing research